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Statistical properties of the taxonomic classification of human languages are studied. It 
is shown that, at the highest levels of the taxonomic hierarchy, the frequency of taxon 
members as a function of the number of languages belonging to each member decays as 
a power law. This feature reveals that a self-similar structure underlies the taxonomy of 
languages, exactly as observed in the taxonomic classification of biological species. Such 
an analogy is a clue to the evolutionary foundation of language classification based on 
long-range comparison. 



1. Introduction 

Comparative linguistics shows that human languages can be grouped in a hierarchy 
of families whose members share a certain level of similarity, much like biological 
species in the taxonomic tree. This hierarchy is defined in terms of mutual relat- 
edness and affinity of languages in their present form, but also takes into account 
evolutionary aspects such as common innovations. Ruhlen compiled data for 
the almost 5,000 extant languages and proposed a taxonomic classification which, 
at the highest level, consists of 17 families. Some instances of these large families 
are the Indo-Hittite, which contains all Indo-European languages and is the largest 
in number of speakers; the Austric, which covers parts of South-Eastern Asia and 
Oceania and is the richest in number of languages; and the Amerind, which was 
one of the latest to be recognized as a family [1). These families are divided into 
primary branches, which in turn contain groups, subgroups, branches, and so on. 
Along certain particularly rich branches (e.g. Bantu, in Africa) Ruhlen's classifica- 
tion distinguishes up to 17 hierarchical levels or taxa. 

The methods of long-range comparison that make possible the identification of 
language families at the highest taxonomic levels have been emphatically criticized 
by many linguists [SB. These authors claim that an upper bound of 6,000-8,000 
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years exists for the time elapsed from the separation of two languages from a com- 
mon ancestor such that any connection between them can be established by com- 
parison. Ruhlen's work, however, has received strong support from a field outside 
linguistics, namely genetics, through detailed studies of the genetic distance be- 
tween human populations by Cavalli-Sforza and coworkers These authors have 
shown that, at the highest levels, the taxonomy of populations is remarkably simi- 
lar to that of languages. The similarity can be traced up to levels corresponding to 
the main population expansions towards Eastern Asia, Oceania, and the Americas, 
some tenths of thousand years ago |^ . 

In this paper, statistical regularities of the taxonomic classification of languages 
at the highest taxa are disclosed. The hierarchical distribution of languages is shown 
to exhibit self-similarity properties, which could hardly be explained if such distri- 
bution were derived from a baseless method. Comparison with the case of biological 
species, in fact, supports an evolutionary basis for the classification of languages. 

2. Analysis 

Our statistical analysis proceeds as follows. We choose a specific taxon of the hierar- 
chy (say, primary branches) . For each member i of that taxon (say, Indo-European) 
we determine the number rii of extant languages belonging to that member (for 
Indo-European, = 144). The set of values rii obtained for the selected taxon is 
then used to construct a histogram. The height of each column in the histogram is 
proportional to the fraction of members whose number of languages lies within the 
interval covered by the column, normalized by the column width. In other words, 
it gives the frequency f{n) of taxon members which contain a given number n of 
languages. 

Table 1. Statistical parameters of the taxonomic classifi- 
cation. The exponents 7 and u characterize the power-law 
dependence of the frequency of members of a given taxon 
on the number of languages and the number of members in 
the successive taxon, respectively. The regression coefficient 
measures the quality of the least-square fitting from which 
7 is obtained. 



taxon 


exponent 7 


regression coefficient 


exponent v 


first 


1.0 ±0.2 


-0.903 


1.0 ±0.2 


second 


1.4 ±0.1 


-0.976 


1.7±0.1 


third 


1.6 ±0.1 


-0.990 


1.7 ±0.1 


fourth 


1.9 ±0.1 


-0.993 


1.9 ±0.1 


fifth 


2.1±0.1 


-0.998 





Results for the highest, first five taxa (families, primary branches, groups, sub- 
groups, and branches) are presented in Fig. |l]. For clarity, the histograms are dis- 
played as sets of points, and the frequencies corresponding to each set are expressed 
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in arbitrary units. In all cases, the data exhibit a regime of well-defined power-law 
decay, f{n) ~ n"^, spanning more than two decades in the number of languages, 
typically from n « 2 to n « 300, and three to four decades in frequencies. The 
exponent 7, obtained from linear least-square fitting on the log-log plot, is given for 
each set in Table ^ The fittings are shown in Fig. |^ as straight lines. As a measure 
of the fitting quality, the regression coefficients are also given in Table 0. They are 
always above 0.9 (in modulus). 




1 10 100 1000 

number of languages 



Fig. 1. Frequency of taxon members, shown as a function of the number of languages belonging 
to each member, for the first five taxonomic levels (families, primary branches, groups, subgroups, 
and branches). For clarity in displaying, the frequencies of each set have been multiplied by an 
appropriate constant. The lines correspond to least-square fittings in the intervals where they are 
plotted. 

The regression coefficients show that the dcfinitcness of the power-law depen- 
dence improves for lower taxa. This is also apparent from Fig. |l|, where the linear 
approximation is relatively poorer for the first taxa. We ascribe this effect to the 
fact that the number of members of a given taxon decreases considerably as higher 
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taxa are considered. For the first taxon, in fact, the (8-column) histogram is con- 
structed from a set of only 17 values. In this specific case, it is more reliable to 
study the distribution of languages in families using a rank plot |Q . The rank r of 
a family is given by its place in a list where families are sorted in decreasing order 
by the number of languages belonging to them {r — 1 for the richest family, r = 2 
for the second richest, and so on). The rank plot displays the number of languages 
as a function of r, as shown in Fig. |^. In this linear-log plot the straight line stands 
for an exponential decay, n ~ exp(— or), and can be shown to correspond to a 
frequency of the form f{n) ^ n^^. These data are reasonably well approximated 
by a linear fit (regression coefficient = —0.988), which is in full agreement with the 
corresponding power-law exponent 7 = 1.0 ± 0.2, quoted in Table 0. 




■| — I — I — I — I — I — I — I — I — I — I — I — I — I — I — I — I — I — I — I — 
5 10 15 20 

rank 

Fig. 2. Rank plot of the first taxon (families). The number of languages belonging to each member 
is plotted as a function of its rank. The straight line corresponds to an exponential decay and has 
been determined by least-square fitting. 

The fact that the frequency f{n) exhibits power-law dependence for several 
taxa implies another important statistical property of the taxonomic classifica- 
tion. Consider two consecutive taxa t and t + 1 with frequencies ft{n) ~ and 
ft+i{n) ~ n~'^ , respectively, and call p{m) the fraction of members of taxon t 
that contain a number m of members of the successive taxon t -\- 1. These three 
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distributions are related through the expression 

ft{n) = ^p{m)ft+i{n) o ft+i{n) o • • ■ o ft+i{n), (2.1) 

m 

where the m-th term involves the m-fold discrete convolution of ft^i with itself. In 
the Laplace domain, this relation reads 

<t>t{s) ^y^p{m)[4>t+i{s)r ^ p{^l)[4>t+l{s)Yd^l, (2.2) 

where <j)t and 4>t+i are the Laplace transforms of ft and ft^t+i, respectively. In the 
right-hand side of Eq. (2^), the variable fi replaces the summation index m to 
produce a continuous approximation to (j)t- The power-law decay of ft(n) implies 
that, near s = 0, its Laplace transform behaves as (f>t{s) ~ exp(— a|s|'''~^) for 
1 < 7 < 2 and as 0t(s) ~ exp(— 6s — c|s|'''^^) for 2 < 7 < 3, where a, b, and c 
are constant coefficients Analogous approximate expressions hold for (j)t+i{s). 
These asymptotic expressions satisfy the continuous approximation in Eq. ( |2.2|) if 
the distribution p(m) is in turn a power law for large m, p(m) ~ m^'^ . The exponent 
J/ is a function of 7 and 7', namely, 

:. = 1 + 1Z1 (2.3) 
7 - 1 

if 1 < 7,7' < 2, and 

1/ = 7 (2.4) 
if 1 < 7 < 2 < 7'. Its value for the first four taxa is also given in Table |. 



3. Discussion and conclusion 

Power-law frequency distributions are known to reveal self-similarity and fractal 
geometry in the underlying structures |^ -in our case, the taxonomic tree. The 
interest of this statistical property of the taxonomic classification of languages re- 
sides in the fact that exactly the same feature is found in the taxonomy of biological 
species. The power-law dependence in the frequency of biological taxon abundance 
has been first discussed by Yule |l^ and, much later, Burlando |l|Jl^ studied 
in detail the distribution of the exponent ly at different levels and along different 
branches of the taxonomic tree, also including some families of extinct species. In 
contrast with the case of languages, the exponent v for biological taxonomy can 
directly be measured on the tree. Indeed, the biological taxonomic tree is very rich 
-it contains more than 1,500,000 species at the lowest level. Even at the high- 
est taxa, one finds members with a large number of members from the successive 
taxon. On the other hand, the extant languages are less than 5, 000. The statistics 
are consequently much poorer, and the exponent v is more reliably inferred from 
the values of 7 and 7', as done above. It has been found that, for biological species, 
v varies in a relatively narrow interval, 1.4 < < 2.5. Note that, except for the first 
taxon, the values of v obtained for language taxonomy are also in that interval. 
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Several models have been proposed to account for the statistical regularities 
in the taxon abundance of biological species, ranging from branching dynamical 
processes JTo| , p^ 14 to simplified macroevolutionary models [|l^^. Though these 



stylized models do reproduce the fractal-like structure of taxonomic trees, which 
suggests that they successfully capture the essential ingredients in the organization 
of biological taxa, they seldom give a quantitatively satisfactory explanation of such 
regularities. By now, however, there is little doubt that the power-law distributions 
found in taxon abundance are a consequence of the inherently complex mechanisms 
that drive biological macroevolution, giving rise to speciation and, more generally, 
originating new members at all the taxonomic levels. Self-similarity and fractal 
features, in fact, have been recognized as a clue to the underlying complexity in a 
large class of dynamical systems |^ . The fact that the same kind of distributions is 
found in the taxonomy of languages strongly suggests that language classification 
reflects, even at its highest levels, the underlying evolutionary mechanisms. 

In summary, we have shown that, at the highest levels of the taxonomic classifi- 
cation of human languages, the frequency of members containing a given number of 
languages decays as a power-law over at least two decades. These systematic regu- 
larities seem to discard the possibility that the classification results from a baseless 
method. Moreover, they imply that the frequency of taxon members containing a 
given number of members from the successive taxon is also well described by a 
power-law distribution. The same property is found in the taxonomy of biological 
species, whose evolutionary origin is firmly established. Along with the genetic ev- 
idence provided by Cavalli-Sforza, this analogy supports an evolutionary basis for 
Ruhlen's classification. 
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